Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Interactive visualization interfaces enable users to efficiently explore, analyze, and make sense of their datasets. However, as data grows in size, it becomes increasingly challenging to build data interfaces that meet the interface designer’s desired latency expectations and resource constraints. Cloud DBMSs, while optimized for scalability, often fail to meet latency expectations, necessitating complex, bespoke query execution and optimization techniques for data interfaces. This involves manually navigating a huge optimization space that is sensitive to interface design and resource constraints, such as client vs server data and compute placement, choosing which computations are done offline vs online, and selecting from a large library of visualization-optimized data structures. This paper advocates for a Physical Visualization Design (PVD) tool that decouples interface design from system design to provide design independence. Given an interfaces underlying data flow, interactions with latency expectations, and resource constraints, PVD checks if the interface is feasible and, if so, proposes and instantiates a middleware architecture spanning the client, server, and cloud DBMS that meets the expectations. To this end, this paper presents Jade, the first prototype PVD tool that enables design independence. Jade proposes an intermediate representation called Diffplans to represent the data flows, develops cost estimation models that trade off between latency guarantees and plan feasibility, and implements an optimization framework to search for the middleware architecture that meets the guarantees. We evaluate Jade on six representative data interfaces as compared to Mosaic and Azure SQL database. We find Jade supports a wider range of interfaces, makes better use of available resources, and can meet a wider range of data, latency, and resource conditions.more » « lessFree, publicly-accessible full text available June 20, 2026
-
Free, publicly-accessible full text available May 19, 2026
-
Interactive visualization interfaces enable users to efficiently explore, analyze, and make sense of their datasets. However, as data grows in size, it becomes increasingly challenging to build data interfaces that meet the interface designer's desired latency expectations and resource constraints. Cloud DBMSs, while optimized for scalability, often fail to meet latency expectations, necessitating complex, bespoke query execution and optimization techniques for data interfaces. This involves manually navigating a huge optimization space that is sensitive to interface design and resource constraints, such as client vs server data and compute placement, choosing which computations are done offline vs online, and selecting from a large library of visualization-optimized data structures. This paper advocates for a Physical Visualization Design (PVD) tool that decouples interface design from system design to provide design independence. Given an interfaces underlying data flow, interactions with latency expectations, and resource constraints, PVD checks if the interface is feasible and, if so, proposes and instantiates a middleware architecture spanning the client, server, and cloud DBMS that meets the expectations. To this end, this paper presents Jade, the first prototype PVD tool that enables design independence. Jade proposes an intermediate representation called Diffplans to represent the data flows, develops cost estimation models that trade off between latency guarantees and plan feasibility, and implements an optimization framework to search for the middleware architecture that meets the guarantees. We evaluate Jade on six representative data interfaces as compared to Mosaic and Azure SQL database. We find Jade supports a wider range of interfaces, makes better use of available resources, and can meet a wider range of data, latency, and resource conditions.more » « lessFree, publicly-accessible full text available June 17, 2026
-
Successful supervised learning models rely on predictive features, which rarely come from a single dataset. As a result, relevant datasets need to be integrated before training the actual model. This raises one natural question: \textit{``how can one efficiently search for predictive features from relevant datasets for integration with responsible AI guarantees?"}. This paper formalizes the question as the \textit{data augmentation search problem} with an objective of minimizing the search latency. We propose \sys, an interactive system that intakes a supervised learning task and searches for a set of join-compatible datasets that optimally improve the performance of the task. Specifically, \sys manages a corpus of relational datasets, uses linear regression as a \textit{proxy model} to evaluate augmentation candidates, and applies \textit{factorized machine learning} to accelerate model training and evaluation algorithmically. Furthermore, \sys leverages system and hardware optimizations to maximize parallelism across augmentation searches. These allow \sys to search for a good augmentation plan over 1 million datasets with a latency of $1.4$ seconds.more » « less
-
Dashboards are vital in modern business intelligence tools, providing non-technical users with an interface to access comprehensive business data. With the rise of cloud technology, there is an increased number of data sources to provide enriched contexts for various analytical tasks, leading to a demand for interactive dashboards over a large number of joins. Nevertheless, joins are among the most expensive operations in DBMSes, making the support of interactive dashboards over joins challenging. In this paper, we present Treant, a dashboard accelerator for queries over large joins. Treant uses factorized query execution to handle aggregation queries over large joins, which alone is still insufficient for interactive speeds. To address this, we exploit the incremental nature of user interactions using Calibrated Junction Hypertree (CJT), a novel data structure that applies lightweight materialization of the intermediates during factorized execution. CJT ensures that the work needed to compute a query is proportional to how different it is from the previous query, rather than the overall complexity. Treant manages CJTs to share work between queries and performs materialization offline or during user think-times. Implemented as a middleware that rewrites SQL, Treant is portable to any SQL-based DBMS. Our experiments on single node and cloud DBMSes show that Treant improves dashboard interactions by two orders of magnitude, and provides 10x improvement for ML augmentation compared to SOTA factorized ML system.more » « less
-
Recent platforms utilize ML task performance metrics, not metadata keywords, to search large data corpus. Requesters provide an initial dataset, and the platform searches for additional datasets that augment---join or union---requester's dataset to most improve the model (e.g., linear regression) performance. Although effective, current task-based data searches are stymied by (1) high latency which deters users, (2) privacy concerns for regulatory standards, and (3) low data quality which provides low utility. We introduce Mileena, a fast, private, and high-quality task-based dataset search platform. At its heart, Mileena is built on pre-computed semi-ring sketches for efficient ML training and evaluation. Based on semi-ring, we develop a novel Factorized Privacy Mechanism that makes the search differentially private and scales to arbitrary corpus sizes and numbers of requests without major quality degradation. We also demonstrate the early promise in using LLM-based agents for automatic data transformation and applying semi-rings to support causal discovery and treatment effect estimation.more » « less
-
Building interactive data interfaces is hard because the design of an interface depends on the data processing needs for the underlying analysis task, yet we do not have a good representation for analysis tasks. To fill this gap, this paper advocates for a Data Interface Grammar (DIG) as an intermediate representation of analysis tasks. We show that DIG is compatible with existing data engineering practices, compact to represent any analysis, simple to translate into an interface design, and amenable to offline analysis. We further illustrate the potential benefits of this abstraction, such as automatic interface generation, automatic interface backend optimization, tutorial generation, and workload generation.more » « less
-
Large language models (LLMs) are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we proposedata quality assertionsto identify when LLMs may be making mistakes. We present spade, a method for automatically synthesizing data quality assertions that identify bad LLM outputs. We make the observation that developers often identify data quality issues during prototyping prior to deployment, and attempt to address them by adding instructions to the LLM prompt over time. spade therefore analyzes histories of prompt versions over time to create candidate assertion functions and then selects a minimal set that fulfills both coverage and accuracy requirements. In testing across nine different real-world LLM pipelines, spade efficiently reduces the number of assertions by 14% and decreases false failures by 21% when compared to simpler baselines. spade has been deployed as an offering within LangSmith, LangChain's LLM pipeline hub, and has been used to generate data quality assertions for over 2000 pipelines across a spectrum of industries.more » « less
-
Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating theYvariable to the residual in thenon-materialized join result.Although this view update problem is generally ambiguous, we identifyaddition-to-multiplication preserving, the key property of variance semi-ring to supportrmsethe most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).more » « less
An official website of the United States government

Full Text Available